Layout Language Model

Layout Language models were introduced, inspired by the BERT model where input textual information is represented by text embeddings and position embeddings.

LayoutLM further adds two types of input embeddings:

  1. a 2-D position embedding that denotes the relative position of a token within a document;
  2. an image embedding for scanned token images within a document.

LayoutLM is the first model where text and layout are jointly learned in a single framework for document level pre-training.

LayoutLM is a simple but effective multi-modal pre-training method of text, layout, and image for visually-rich document understanding and information extraction tasks, such as form understanding and receipt understanding.

We have differnt versions of Layout LM like LayoutLM, LayoutLMv2 and LayoutLMv3 and all the models performs way better than the SOTA(State of the Art) Models results on multiple datasets.

We can use the Layout language model from the transformers python module.

We can use the LayoutLMv3FeatureExtractor module to extract the features from the documents. The features returned consists of three components. They are

  1. Words - Which contains the text data of the document
  2. Boxes - Which the text boxes co-ordinates values.
  3. Pixel Values - Pixel Values.

This model also uses the Tesseract engine and Tensorflow/Pytorch libraries in backend to run the models.

Lets check the architecture of the Layout Language Model

Screenshot 2022-12-07 at 6.44.13 PM.png

Lets try to run the model

Install the following requirements.

Install teserract engine to the system.

Need the following command if we are trying to use teserract engine for telugu language predictions

Import the necessary modules.

One of the important module is transformers which has inbuilt libraries of Layout Language Models

I have stored my files in my Google Drive. So I need to mount the Drive to the colab notebook

We have google drive mounted and we can use the files to read.

Lets load one image of wantok_images.

Lets check te size of the image

Lets see the loaded image

Now we have our image. Lets try to extract the features of the above image using LayoutLM FeatureExtractor

Feature Extraction :

From above output we can see some words are predicted. Let's concat all the words see the entire text

Let's see the predicted ocr

From above we can see some text got extracted from the image.

Now lets try to plot the boundary boxes of these text

Boundary Boxes :

From above output we can clearly see the boxes around the text.

Now lets check other ways to extract the text from the same image

PADDLE OCR

Paddle OCR is one of the open source library which is practical ultra-lightweight pre-trained model, support training and deployment among server, mobile, embedded and IoT devices.

Paddle OCR is mainly designed and trained on recognising the Chinese and English character recognition. But the proposed model is also verified in several language recognition tasks like French, Korean, Japanese and German.

As mentioned it is very light weighted and can be used with or without GPU. It returns the three output components which are

  1. Text detection
  2. Detected Boxes
  3. Text recognition

Once all the requirements are installed lets create the PaddleOCR API object

As we are using google colab, it doesn't support the cv2 module directly. So I need to import proper package.

Also as I mentioned i stored my files in Google Drive I need to mount Google Drive.

We have mounted google drive and got the files from drive.

Lets load one image and view once.

We can view the image clearly. Now lets try to extract the text from this image.

We have to use the OCR package to read the image and extract the text

Now we got the result from the OCR. Lets try to print the each line from the result

Now lets see the entire concatenated text

From above output cell we can see the extracted text.

Now lets try to plot the boxes around the text.

Lets see the output image

From above image we can clearly see that the text is extracted from the image and also we can see the boundary boxes. In addition, it is also giving the each word and its score so that how accurate the word is.

The below few cells are some benchmarks files. I have run the paddleocr on my Resume to see the results more accurate.

From above I can see that paddleocr has extracted text from the resume so accurately.

Let's check the other way of extracting text. It is one of the famous library. Teserract Engine is being used by many libraries internally

Teserract Model

Tesseract is an Optical Character Recognition Engine for various operating system.

To get acess to this engine in our operating system we need to install the following module in out system.

To install tesseract engine in our machine we need to run the following module based on the system environment.

For Linux/Ubuntu environment:

sudo apt-get install tesseract-ocr

For MacOS environment:

For macOS users, we’ll be using Homebrew to install Tesseract

brew install tesseract

If you just want to update tesseract without updating any other bre components. Use the following command.

HOMEBREW_NO_AUTO_UPDATE=1 brew install tesseract

Once we install the above commands based on our environment, we will have access to tesseract engine in our machine. Also we need one of the Python library "Pytesseract" to run the tesseract model which we already installed through our requirements.txt.

Tesseract used the power of OCR with AI to capture data from structured and unstructured data. This module extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats.

Like PaddleOCR, Tesseract model is also a light weight and can be run with or without GPU. Also, Tesseract give the extra abilities to predict the images from the BLUR background or the bright background.

Now import all necessary modules

Once we load the data we can directly run the pyteserract API on the image

From above output cell, we can see the extracted text.

Conclusion:

I am trying to extract the text from the wantok images using different ways. Actuallt I have worked on one more method of extractig text which is Google Cloud. I have inserted that code in the github and you can find that from Google Cloud API.

Also you can find the above methods of python version in the Github.

Also I have planned to do two more things on this project which are Evaluating metrics and fine tuning the Layout Language Model. But I couldn't get the labels of the current dataset. So currently I am not able to do that but for sure I am going to generate some ground truth labels and perform some evaluationa and fine tune the model to get better results